{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using custom containers with AI Platform Training\n",
    "\n",
    "**Learning Objectives:**\n",
    "1. Learn how to create a train and a validation split with BigQuery\n",
    "1. Learn how to wrap a machine learning model into a Docker container and train in on AI Platform\n",
    "1. Learn how to use the hyperparameter tunning engine on Google Cloud to find the best hyperparameters\n",
    "1. Learn how to deploy a trained machine learning model Google Cloud as a rest API and query it\n",
    "\n",
    "In this lab, you develop a  multi-class classification model, package the model as a docker image, and run on **AI Platform Training** as a training application. The training application trains a multi-class classification model that predicts the type of forest cover from cartographic data. The [dataset](../../../datasets/covertype/README.md) used in the lab is based on **Covertype Data Set** from UCI Machine Learning Repository.\n",
    "\n",
    "Scikit-learn is one of the most useful libraries for machine learning in Python. The training code uses `scikit-learn` for data pre-processing and modeling. \n",
    "\n",
    "The code is instrumented using the `hypertune` package so it can be used with **AI Platform** hyperparameter tuning job in searching for the best combination of hyperparameter values by optimizing the metrics you specified.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import os\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import pickle\n",
    "import uuid\n",
    "import time\n",
    "import tempfile\n",
    "\n",
    "from googleapiclient import discovery\n",
    "from googleapiclient import errors\n",
    "\n",
    "from google.cloud import bigquery\n",
    "from jinja2 import Template\n",
    "from kfp.components import func_to_container_op\n",
    "from typing import NamedTuple\n",
    "\n",
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.linear_model import SGDClassifier\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
    "from sklearn.compose import ColumnTransformer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the command in the cell below to install gcsfs package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install gcsfs==0.8"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare lab dataset\n",
    "\n",
    "Set environment variable so that we can use them throughout the entire lab.\n",
    "\n",
    "The pipeline ingests data from BigQuery. The cell below uploads the Covertype dataset to BigQuery.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PROJECT_ID = !(gcloud config get-value core/project)\n",
    "PROJECT_ID = PROJECT_ID[0]\n",
    "DATASET_ID='covertype_dataset'\n",
    "DATASET_LOCATION='US'\n",
    "TABLE_ID='covertype'\n",
    "DATA_SOURCE='gs://cloud-training/OCBL203/workshop-datasets/dataset.csv'\n",
    "SCHEMA='Elevation:INTEGER,Aspect:INTEGER,Slope:INTEGER,Horizontal_Distance_To_Hydrology:INTEGER,Vertical_Distance_To_Hydrology:INTEGER,Horizontal_Distance_To_Roadways:INTEGER,Hillshade_9am:INTEGER,Hillshade_Noon:INTEGER,Hillshade_3pm:INTEGER,Horizontal_Distance_To_Fire_Points:INTEGER,Wilderness_Area:STRING,Soil_Type:STRING,Cover_Type:INTEGER'\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, create the BigQuery dataset and upload the Covertype csv data into a table.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!bq --location=$DATASET_LOCATION --project_id=$PROJECT_ID mk --dataset $DATASET_ID\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!bq --project_id=$PROJECT_ID --dataset_id=$DATASET_ID load \\\n",
    "--source_format=CSV \\\n",
    "--skip_leading_rows=1 \\\n",
    "--replace \\\n",
    "$TABLE_ID \\\n",
    "$DATA_SOURCE \\\n",
    "$SCHEMA\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Configure environment settings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set location paths, connections strings, and other environment settings. Make sure to update   `REGION`, and `ARTIFACT_STORE`  with the settings reflecting your lab environment. \n",
    "\n",
    "- `REGION` - the compute region for AI Platform Training and Prediction\n",
    "- `ARTIFACT_STORE` - the Cloud Storage bucket created during installation of AI Platform Pipelines. The bucket name starts with the `qwiklabs-gcp-xx-xxxxxxx-kubeflowpipelines-default` prefix.\n",
    "\n",
    " Run gsutil ls without URLs to list all of the Cloud Storage buckets under your default project ID. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!gsutil ls"
   ]
  },
  {
   "source": [
    "**HINT:** For ARTIFACT_STORE, copy the bucket name which starts with the qwiklabs-gcp-xx-xxxxxxx-kubeflowpipelines-default prefix from the previous cell output. \n",
    "\n",
    "Your copied value should look like 'gs://qwiklabs-gcp-xx-xxxxxxx-kubeflowpipelines-default')."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "REGION = 'us-central1'\n",
    "ARTIFACT_STORE = 'gs://qwiklabs-gcp-xx-xxxxxxx-kubeflowpipelines-default' # TO DO: REPLACE WITH YOUR ARTIFACT_STORE NAME\n",
    "\n",
    "PROJECT_ID = !(gcloud config get-value core/project)\n",
    "PROJECT_ID = PROJECT_ID[0]\n",
    "DATA_ROOT='{}/data'.format(ARTIFACT_STORE)\n",
    "JOB_DIR_ROOT='{}/jobs'.format(ARTIFACT_STORE)\n",
    "TRAINING_FILE_PATH='{}/{}/{}'.format(DATA_ROOT, 'training', 'dataset.csv')\n",
    "VALIDATION_FILE_PATH='{}/{}/{}'.format(DATA_ROOT, 'validation', 'dataset.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explore the Covertype dataset \n",
    "\n",
    "Run the query statement below to scan covertype_dataset.covertype table in BigQuery and return the computed result rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery\n",
    "SELECT *\n",
    "FROM `covertype_dataset.covertype`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create training and validation splits\n",
    "\n",
    "Use BigQuery to sample training and validation splits and save them to Cloud Storage.\n",
    "### Create a training split\n",
    "\n",
    "Run the query below in order to have repeatable sampling of the data in BigQuery. Note that `FARM_FINGERPRINT()` is used on the field that you are going to split your data. It creates a training split that takes 80% of the data using the `bq` command and exports this split into the BigQuery table of `covertype_dataset.training`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!bq query \\\n",
    "-n 0 \\\n",
    "--destination_table covertype_dataset.training \\\n",
    "--replace \\\n",
    "--use_legacy_sql=false \\\n",
    "'SELECT * \\\n",
    "FROM `covertype_dataset.covertype` AS cover \\\n",
    "WHERE \\\n",
    "MOD(ABS(FARM_FINGERPRINT(TO_JSON_STRING(cover))), 10) IN (1, 2, 3, 4)' "
   ]
  },
  {
   "source": [
    "Use the bq extract command to export the BigQuery training table to GCS at $TRAINING_FILE_PATH."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!bq extract \\\n",
    "--destination_format CSV \\\n",
    "covertype_dataset.training \\\n",
    "$TRAINING_FILE_PATH"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a validation split\n",
    "\n",
    "Run the query below to create a validation split that takes 10% of the data using the `bq` command and export this split into the BigQuery table `covertype_dataset.validation`.\n",
    "\n",
    "In the second cell, use the `bq` command to export that BigQuery validation table to GCS at `$VALIDATION_FILE_PATH.`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!bq query \\\n",
    "-n 0 \\\n",
    "--destination_table covertype_dataset.validation \\\n",
    "--replace \\\n",
    "--use_legacy_sql=false \\\n",
    "'SELECT * \\\n",
    "FROM `covertype_dataset.covertype` AS cover \\\n",
    "WHERE \\\n",
    "MOD(ABS(FARM_FINGERPRINT(TO_JSON_STRING(cover))), 10) IN (8)' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!bq extract \\\n",
    "--destination_format CSV \\\n",
    "covertype_dataset.validation \\\n",
    "$VALIDATION_FILE_PATH"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train = pd.read_csv(TRAINING_FILE_PATH)\n",
    "df_validation = pd.read_csv(VALIDATION_FILE_PATH)\n",
    "print(df_train.shape)\n",
    "print(df_validation.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Develop a training application"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Configure the `sklearn` training pipeline.\n",
    "\n",
    "The training pipeline preprocesses data by standardizing all numeric features using `sklearn.preprocessing.StandardScaler` and encoding all categorical features using `sklearn.preprocessing.OneHotEncoder`. It uses stochastic gradient descent linear classifier (`SGDClassifier`) for modeling."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "numeric_feature_indexes = slice(0, 10)\n",
    "categorical_feature_indexes = slice(10, 12)\n",
    "\n",
    "preprocessor = ColumnTransformer(\n",
    "    transformers=[\n",
    "        ('num', StandardScaler(), numeric_feature_indexes),\n",
    "        ('cat', OneHotEncoder(), categorical_feature_indexes) \n",
    "    ])\n",
    "\n",
    "pipeline = Pipeline([\n",
    "    ('preprocessor', preprocessor),\n",
    "    ('classifier', SGDClassifier(loss='log', tol=1e-3))\n",
    "])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Convert all numeric features to `float64`\n",
    "\n",
    "To avoid warning messages from `StandardScaler` all numeric features are converted to `float64`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_features_type_map = {feature: 'float64' for feature in df_train.columns[numeric_feature_indexes]}\n",
    "\n",
    "df_train = df_train.astype(num_features_type_map)\n",
    "df_validation = df_validation.astype(num_features_type_map)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run the pipeline locally."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = df_train.drop('Cover_Type', axis=1)\n",
    "y_train = df_train['Cover_Type']\n",
    "X_validation = df_validation.drop('Cover_Type', axis=1)\n",
    "y_validation = df_validation['Cover_Type']\n",
    "\n",
    "pipeline.set_params(classifier__alpha=0.001, classifier__max_iter=200)\n",
    "pipeline.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Calculate the trained model's accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "accuracy = pipeline.score(X_validation, y_validation)\n",
    "print(accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prepare the hyperparameter tuning application.\n",
    "Since the training run on this dataset is computationally expensive you can benefit from running a distributed hyperparameter tuning job on AI Platform Training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "TRAINING_APP_FOLDER = 'training_app'\n",
    "os.makedirs(TRAINING_APP_FOLDER, exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Write the tuning script. \n",
    "\n",
    "Notice the use of the `hypertune` package to report the `accuracy` optimization metric to AI Platform hyperparameter tuning service."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile {TRAINING_APP_FOLDER}/train.py\n",
    "\n",
    "# Copyright 2019 Google Inc. All Rights Reserved.\n",
    "#\n",
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "#            http://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License.\n",
    "\n",
    "import os\n",
    "import subprocess\n",
    "import sys\n",
    "\n",
    "import fire\n",
    "import pickle\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "import hypertune\n",
    "\n",
    "from sklearn.compose import ColumnTransformer\n",
    "from sklearn.linear_model import SGDClassifier\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
    "\n",
    "\n",
    "def train_evaluate(job_dir, training_dataset_path, validation_dataset_path, alpha, max_iter, hptune):\n",
    "    \n",
    "    df_train = pd.read_csv(training_dataset_path)\n",
    "    df_validation = pd.read_csv(validation_dataset_path)\n",
    "\n",
    "    if not hptune:\n",
    "        df_train = pd.concat([df_train, df_validation])\n",
    "\n",
    "    numeric_feature_indexes = slice(0, 10)\n",
    "    categorical_feature_indexes = slice(10, 12)\n",
    "\n",
    "    preprocessor = ColumnTransformer(\n",
    "    transformers=[\n",
    "        ('num', StandardScaler(), numeric_feature_indexes),\n",
    "        ('cat', OneHotEncoder(), categorical_feature_indexes) \n",
    "    ])\n",
    "\n",
    "    pipeline = Pipeline([\n",
    "        ('preprocessor', preprocessor),\n",
    "        ('classifier', SGDClassifier(loss='log',tol=1e-3))\n",
    "    ])\n",
    "\n",
    "    num_features_type_map = {feature: 'float64' for feature in df_train.columns[numeric_feature_indexes]}\n",
    "    df_train = df_train.astype(num_features_type_map)\n",
    "    df_validation = df_validation.astype(num_features_type_map) \n",
    "\n",
    "    print('Starting training: alpha={}, max_iter={}'.format(alpha, max_iter))\n",
    "    X_train = df_train.drop('Cover_Type', axis=1)\n",
    "    y_train = df_train['Cover_Type']\n",
    "\n",
    "    pipeline.set_params(classifier__alpha=alpha, classifier__max_iter=max_iter)\n",
    "    pipeline.fit(X_train, y_train)\n",
    "\n",
    "    if hptune:\n",
    "        X_validation = df_validation.drop('Cover_Type', axis=1)\n",
    "        y_validation = df_validation['Cover_Type']\n",
    "        accuracy = pipeline.score(X_validation, y_validation)\n",
    "        print('Model accuracy: {}'.format(accuracy))\n",
    "        # Log it with hypertune\n",
    "        hpt = hypertune.HyperTune()\n",
    "        hpt.report_hyperparameter_tuning_metric(\n",
    "          hyperparameter_metric_tag='accuracy',\n",
    "          metric_value=accuracy\n",
    "        )\n",
    "\n",
    "    # Save the model\n",
    "    if not hptune:\n",
    "        model_filename = 'model.pkl'\n",
    "        with open(model_filename, 'wb') as model_file:\n",
    "            pickle.dump(pipeline, model_file)\n",
    "        gcs_model_path = \"{}/{}\".format(job_dir, model_filename)\n",
    "        subprocess.check_call(['gsutil', 'cp', model_filename, gcs_model_path], stderr=sys.stdout)\n",
    "        print(\"Saved model in: {}\".format(gcs_model_path)) \n",
    "    \n",
    "if __name__ == \"__main__\":\n",
    "    fire.Fire(train_evaluate)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Package the script into a docker image.\n",
    "\n",
    "Notice that we are installing specific versions of `scikit-learn` and `pandas` in the training image. This is done to make sure that the training runtime is aligned with the serving runtime. Later in the notebook you will deploy the model to AI Platform Prediction, using the 1.15 version of AI Platform Prediction runtime. \n",
    "\n",
    "Make sure to update the URI for the base image so that it points to your project's **Container Registry**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile {TRAINING_APP_FOLDER}/Dockerfile\n",
    "\n",
    "FROM gcr.io/deeplearning-platform-release/base-cpu\n",
    "RUN pip install -U fire cloudml-hypertune scikit-learn==0.20.4 pandas==0.24.2\n",
    "WORKDIR /app\n",
    "COPY train.py .\n",
    "\n",
    "ENTRYPOINT [\"python\", \"train.py\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build the docker image. \n",
    "\n",
    "You use **Cloud Build** to build the image and push it your project's **Container Registry**. As you use the remote cloud service to build the image, you don't need a local installation of Docker."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "IMAGE_NAME='trainer_image'\n",
    "IMAGE_TAG='latest'\n",
    "IMAGE_URI='gcr.io/{}/{}:{}'.format(PROJECT_ID, IMAGE_NAME, IMAGE_TAG)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!gcloud builds submit --tag $IMAGE_URI $TRAINING_APP_FOLDER"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Submit an AI Platform hyperparameter tuning job"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create the hyperparameter configuration file. \n",
    "Recall that the training code uses `SGDClassifier`. The training application has been designed to accept two hyperparameters that control `SGDClassifier`:\n",
    "- Max iterations\n",
    "- Alpha\n",
    "\n",
    "The below file configures AI Platform hypertuning to run up to 6 trials on up to three nodes and to choose from two discrete values of `max_iter` and the linear range betwee 0.00001 and 0.001 for `alpha`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile {TRAINING_APP_FOLDER}/hptuning_config.yaml\n",
    "\n",
    "# Copyright 2019 Google Inc. All Rights Reserved.\n",
    "#\n",
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "#            http://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License.\n",
    "\n",
    "trainingInput:\n",
    "  hyperparameters:\n",
    "    goal: MAXIMIZE\n",
    "    maxTrials: 4\n",
    "    maxParallelTrials: 4\n",
    "    hyperparameterMetricTag: accuracy\n",
    "    enableTrialEarlyStopping: TRUE \n",
    "    params:\n",
    "    - parameterName: max_iter\n",
    "      type: DISCRETE\n",
    "      discreteValues: [\n",
    "          200,\n",
    "          500\n",
    "          ]\n",
    "    - parameterName: alpha\n",
    "      type: DOUBLE\n",
    "      minValue:  0.00001\n",
    "      maxValue:  0.001\n",
    "      scaleType: UNIT_LINEAR_SCALE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Start the hyperparameter tuning job.\n",
    "\n",
    "Use the `gcloud` command to start the hyperparameter tuning job."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "JOB_NAME = \"JOB_{}\".format(time.strftime(\"%Y%m%d_%H%M%S\"))\n",
    "JOB_DIR = \"{}/{}\".format(JOB_DIR_ROOT, JOB_NAME)\n",
    "SCALE_TIER = \"BASIC\"\n",
    "\n",
    "!gcloud ai-platform jobs submit training $JOB_NAME \\\n",
    "--region=$REGION \\\n",
    "--job-dir=$JOB_DIR \\\n",
    "--master-image-uri=$IMAGE_URI \\\n",
    "--scale-tier=$SCALE_TIER \\\n",
    "--config $TRAINING_APP_FOLDER/hptuning_config.yaml \\\n",
    "-- \\\n",
    "--training_dataset_path=$TRAINING_FILE_PATH \\\n",
    "--validation_dataset_path=$VALIDATION_FILE_PATH \\\n",
    "--hptune"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Monitor the job.\n",
    "\n",
    "You can monitor the job using Google Cloud console or from within the notebook using `gcloud` commands."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!gcloud ai-platform jobs describe $JOB_NAME"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!gcloud ai-platform jobs stream-logs $JOB_NAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**NOTE: The above AI platform job stream logs will take approximately 5~10 minutes to display.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Retrieve HP-tuning results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After the job completes you can review the results using Google Cloud Console or programatically by calling the AI Platform Training REST end-point."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ml = discovery.build('ml', 'v1')\n",
    "\n",
    "job_id = 'projects/{}/jobs/{}'.format(PROJECT_ID, JOB_NAME)\n",
    "request = ml.projects().jobs().get(name=job_id)\n",
    "\n",
    "try:\n",
    "    response = request.execute()\n",
    "except errors.HttpError as err:\n",
    "    print(err)\n",
    "except:\n",
    "    print(\"Unexpected error\")\n",
    "    \n",
    "response"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The returned run results are sorted by a value of the optimization metric. The best run is the first item on the returned list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response['trainingOutput']['trials'][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Retrain the model with the best hyperparameters\n",
    "\n",
    "You can now retrain the model using the best hyperparameters and using combined training and validation splits as a training dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Configure and run the training job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "alpha = response['trainingOutput']['trials'][0]['hyperparameters']['alpha']\n",
    "max_iter = response['trainingOutput']['trials'][0]['hyperparameters']['max_iter']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "JOB_NAME = \"JOB_{}\".format(time.strftime(\"%Y%m%d_%H%M%S\"))\n",
    "JOB_DIR = \"{}/{}\".format(JOB_DIR_ROOT, JOB_NAME)\n",
    "SCALE_TIER = \"BASIC\"\n",
    "\n",
    "!gcloud ai-platform jobs submit training $JOB_NAME \\\n",
    "--region=$REGION \\\n",
    "--job-dir=$JOB_DIR \\\n",
    "--master-image-uri=$IMAGE_URI \\\n",
    "--scale-tier=$SCALE_TIER \\\n",
    "-- \\\n",
    "--training_dataset_path=$TRAINING_FILE_PATH \\\n",
    "--validation_dataset_path=$VALIDATION_FILE_PATH \\\n",
    "--alpha=$alpha \\\n",
    "--max_iter=$max_iter \\\n",
    "--nohptune"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!gcloud ai-platform jobs stream-logs $JOB_NAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**NOTE: The above AI platform job stream logs will take approximately 5~10 minutes to display.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Examine the training output\n",
    "\n",
    "The training script saved the trained model as the 'model.pkl' in the `JOB_DIR` folder on Cloud Storage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!gsutil ls $JOB_DIR"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deploy the model to AI Platform Prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a model resource\n",
    "\n",
    "Use the gcloud command to create a model with model_name in $REGION tagged with labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_name = 'forest_cover_classifier'\n",
    "labels = \"task=classifier,domain=forestry\"\n",
    "\n",
    "!gcloud ai-platform models create  $model_name \\\n",
    " --regions=$REGION \\\n",
    " --labels=$labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a model version\n",
    "\n",
    "Use the gcloud command to create a version of the model.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_version = 'v01'\n",
    "\n",
    "!gcloud ai-platform versions create {model_version} \\\n",
    " --model={model_name} \\\n",
    " --origin=$JOB_DIR \\\n",
    " --runtime-version=1.15 \\\n",
    " --framework=scikit-learn \\\n",
    " --python-version=3.7\\\n",
    " --region global"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Serve predictions\n",
    "#### Prepare the input file with JSON formated instances."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_file = 'serving_instances.json'\n",
    "\n",
    "with open(input_file, 'w') as f:\n",
    "    for index, row in X_validation.head().iterrows():\n",
    "        f.write(json.dumps(list(row.values)))\n",
    "        f.write('\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat $input_file"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Invoke the model\n",
    "\n",
    "Use the gcloud command send the data in $input_file to your model deployed as a REST API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!gcloud ai-platform predict \\\n",
    "--model $model_name \\\n",
    "--version $model_version \\\n",
    "--json-instances $input_file\\\n",
    "--region global"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font size=-1>Licensed under the Apache License, Version 2.0 (the \\\"License\\\");\n",
    "you may not use this file except in compliance with the License.\n",
    "You may obtain a copy of the License at [https://www.apache.org/licenses/LICENSE-2.0](https://www.apache.org/licenses/LICENSE-2.0)\n",
    "\n",
    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \\\"AS IS\\\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.</font>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
